159 research outputs found

    Predicting Genetic Regulatory Response Using Classification

    Full text link
    We present a novel classification-based method for learning to predict gene regulatory response. Our approach is motivated by the hypothesis that in simple organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular experiment based on (1) the presence of binding site subsequences (``motifs'') in the gene's regulatory region and (2) the expression levels of regulators such as transcription factors in the experiment (``parents''). Thus our learning task integrates two qualitatively different data sources: genome-wide cDNA microarray data across multiple perturbation and mutant experiments along with motif profile data from regulatory sequences. We convert the regression task of predicting real-valued gene expression measurement to a classification task of predicting +1 and -1 labels, corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. The learning algorithm employed is boosting with a margin-based generalization of decision trees, alternating decision trees. This large-margin classifier is sufficiently flexible to allow complex logical functions, yet sufficiently simple to give insight into the combinatorial mechanisms of gene regulation. We observe encouraging prediction accuracy on experiments based on the Gasch S. cerevisiae dataset, and we show that we can accurately predict up- and down-regulation on held-out experiments. Our method thus provides predictive hypotheses, suggests biological experiments, and provides interpretable insight into the structure of genetic regulatory networks.Comment: 8 pages, 4 figures, presented at Twelfth International Conference on Intelligent Systems for Molecular Biology (ISMB 2004), supplemental website: http://www.cs.columbia.edu/compbio/geneclas

    A Flexible and Adaptive Framework for Abstention Under Class Imbalance

    Full text link
    In practical applications of machine learning, it is often desirable to identify and abstain on examples where the model's predictions are likely to be incorrect. Much of the prior work on this topic focused on out-of-distribution detection or performance metrics such as top-k accuracy. Comparatively little attention was given to metrics such as area-under-the-curve or Cohen's Kappa, which are extremely relevant for imbalanced datasets. Abstention strategies aimed at top-k accuracy can produce poor results on these metrics when applied to imbalanced datasets, even when all examples are in-distribution. We propose a framework to address this gap. Our framework leverages the insight that calibrated probability estimates can be used as a proxy for the true class labels, thereby allowing us to estimate the change in an arbitrary metric if an example were abstained on. Using this framework, we derive computationally efficient metric-specific abstention algorithms for optimizing the sensitivity at a target specificity level, the area under the ROC, and the weighted Cohen's Kappa. Because our method relies only on calibrated probability estimates, we further show that by leveraging recent work on domain adaptation under label shift, we can generalize to test-set distributions that may have a different class imbalance compared to the training set distribution. On various experiments involving medical imaging, natural language processing, computer vision and genomics, we demonstrate the effectiveness of our approach. Source code available at https://github.com/blindauth/abstention. Colab notebooks reproducing results available at https://github.com/blindauth/abstention_experiments

    Comparative analysis of metazoan chromatin organization

    Get PDF
    Genome function is dynamically regulated in part by chromatin, which consists of the histones, non-histone proteins and RNA molecules that package DNA. Studies in Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular mechanisms of genome function in humans, and have revealed conservation of chromatin components and mechanisms. Nevertheless, the three organisms have markedly different genome sizes, chromosome architecture and gene organization. On human and fly chromosomes, for example, pericentric heterochromatin flanks single centromeres, whereas worm chromosomes have dispersed heterochromatin-like regions enriched in the distal chromosomal ‘arms’, and centromeres distributed along their lengths. To systematically investigate chromatin organization and associated gene regulation across species, we generated and analysed a large collection of genome-wide chromatin data sets from cell lines and developmental stages in worm, fly and human. Here we present over 800 new data sets from our ENCODE and modENCODE consortia, bringing the total to over 1,400. Comparison of combinatorial patterns of histone modifications, nuclear lamina-associated domains, organization of large-scale topological domains, chromatin environment at promoters and enhancers, nucleosome positioning, and DNA replication patterns reveals many conserved features of chromatin organization among the three organisms. We also find notable differences in the composition and locations of repressive chromatin. These data sets and analyses provide a rich resource for comparative and species-specific investigations of chromatin composition, organization and function.National Science Foundation (U.S.) (1122374

    Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design

    Full text link
    The efficient exploration of chemical space to design molecules with intended properties enables the accelerated discovery of drugs, materials, and catalysts, and is one of the most important outstanding challenges in chemistry. Encouraged by the recent surge in computer power and artificial intelligence development, many algorithms have been developed to tackle this problem. However, despite the emergence of many new approaches in recent years, comparatively little progress has been made in developing realistic benchmarks that reflect the complexity of molecular design for real-world applications. In this work, we develop a set of practical benchmark tasks relying on physical simulation of molecular systems mimicking real-life molecular design problems for materials, drugs, and chemical reactions. Additionally, we demonstrate the utility and ease of use of our new benchmark set by demonstrating how to compare the performance of several well-established families of algorithms. Surprisingly, we find that model performance can strongly depend on the benchmark domain. We believe that our benchmark suite will help move the field towards more realistic molecular design benchmarks, and move the development of inverse molecular design algorithms closer to designing molecules that solve existing problems in both academia and industry alike.Comment: 29+21 pages, 6+19 figures, 6+2 table
    • …
    corecore